This is a side project of analyzing of the Breast Cancer Wisconsin (Diagnostic) DataSet, obtained from Kaggle. I are going to try two different machine learning classification models to compare the results. I've divided my presentation into two sections.
Let’s start by exploring the data.
import pandas as pd
# read the file
df = pd.read_csv("/Users/jiezhao/Downloads/data.csv")
# print the columns name of dataset
print(df.columns)
# check the data set
df
# find whether there exist missing value in any columns
df.isnull().sum()
# clearly the last columns are almost all missing value, we are going to drop it
# df.dropna() #drop all rows that have any NaN values
# df.dropna(how='all') #Drop rows where all cells in that row is NA
#df.drop('id',axis=1,inplace=True)
#df.drop('Unnamed: 32',axis=1,inplace=True)
df1=df.copy();
df1=df1.iloc[:,1:32];
df1['diagnosis'] = df['diagnosis'].map({'M':1,'B':0})
#df1.columns = list(df.iloc[:,1:32].columns)
df1.head()
From above we can see, except the "diagnosis", all other features are numerical value(float) range huge differently. To effeciently modeling the data, I am going to rescale the data.
from sklearn import preprocessing
df2=pd.DataFrame(preprocessing.scale(df1.iloc[:,0:31]));
df2.columns = list(df1.iloc[:,0:31].columns)
df2['diagnosis'] = df1['diagnosis']
df2.head()
Now lets check the correlation between features so that we can remove multicolinearity if it exists.
import seaborn as sns
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(30,30))
sns.set(font_scale=1.5)
sns.heatmap(df2.iloc[:,1:31].corr(),cbar=True,fmt =' .2f', annot=True)
#xticklabels= df2, yticklabels= df2,
sns.plt.show()
sns.pairplot(df2.iloc[:,1:31])
sns.plt.show()
Clearly, there exists strong colinearity between some features. For example, the highest correlations are between:
This multicolinearity could cause machine learning models fail. To reduce it, we use PCA.
from sklearn.decomposition import PCA
import numpy as np
# determine how many components are needed to describe the data
pca = PCA().fit(df2.iloc[:,0:31])
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');
plt.show()
pca=PCA(n_components=0.9)
pcanew=pca.fit_transform(df2.iloc[:,0:31])
print (pcanew.shape)
print(pca.explained_variance_ratio_)
print (pca.explained_variance_ratio_.sum())
pca=PCA(n_components=0.95)
pcanew=pca.fit_transform(df2.iloc[:,1:31])
print (pcanew.shape)
print(pca.explained_variance_ratio_)
print (pca.explained_variance_ratio_.sum())
Therefore, to explain over 90% of the variance, we need to include the first 7 components; to explain over 95% of the variance, we need the first 11 components.
from sklearn.linear_model import LogisticRegression
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
y= df2.iloc[:,0:1].as_matrix();
print(pcanew.shape)
print(y.shape)
dfnew=pd.DataFrame(np.hstack((pcanew,y)));
X=pcanew;
print(X.shape)
X_train, X_test, y_train, y_test = train_test_split(X, np.ravel(y));
# train scikit learn model
clf = LogisticRegression()
clf.fit(X_train,y_train)
#print ('score Scikit learn: ', clf.score(X_test,y_test))
# prediction
y_pred =clf.predict(X_test)
#computing and plotting confusion matrix
c_m = confusion_matrix(y_test,y_pred)
print('Logistic Regression:\nconfusion matrix\n', c_m,'\n\n')
ax=plt.matshow(c_m,cmap=plt.cm.Blues)
print('Confusion matrix plot of Logistic regression')
plt.colorbar(ax)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
# classification report
print('\n Classification report \n',classification_report(y_test, y_pred))
# Support Vector Classificatio(SVC)
# fitting the SVC model on the training data and predicting for test data
#Radial Basis Function is a commonly used kernel
# gamma is a parameter of the RBF kernel and can be thought of as the 'spread' of the kernel and
#therefore the decision region. When gamma is low, the 'curve' of the decision boundary is
#very low and thus the decision region is very broad.
#When gamma is high, the 'curve' of the decision boundary is high,
#which creates islands of decision-boundaries around data points. We will see this very clearly below.
#C is a parameter of the SVC learner and is the penalty for misclassifying a data point.
#When C is small, the classifier is okay with misclassified data points (high bias, low variance).
#When C is large, the classifier is heavily penalized for misclassified data and therefore bends over
#backwards avoid any misclassified data points (low bias, high variance).
from sklearn.svm import SVC
svc=SVC(C=100,gamma=0.001,kernel='rbf',probability=True)
svc.fit(X_train, y_train)
y_pred_svc =svc.predict(X_test)
# computing and plotting confusion matrix
c_m = confusion_matrix(y_test, y_pred_svc)
print('SVC:\n confusion matrix\n', c_m,'\n\n')
ax=plt.matshow(c_m,cmap=plt.cm.Blues)
print('Confusion matrix plot of SVC')
plt.colorbar(ax)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()
# classification report
print('\n Classification report \n',classification_report(y_test, y_pred_svc))
print ('#############################################################################')
Conclusion: I do the feature analysis and show that there are some features with strong colinearity. The observations were confirmed by the PCA analysis,for example, concave.ponts_worst,concavity_worst, concavity_mean, perimeter_worst, area_worst, radius_worst,perimeter_mean, area_mean, radius_mean. After remove these nulticolinerity, we were able to predict with high accuracy the malignant and benign tumors using different models. As results show, both SVC and logistic regression are performing equally good.